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ABSTRACT 

Many database applications deploy hundreds or thousands of in- 
dexes to speed up query execution. Despite a plethora of prior work 
on index selection, designing and deploying indexes remains a dif- 
ficult task for database administrators. First, real-world businesses 
often require online index deployment, and the traditional off-line 
approach to index selection ignores intermediate workload perfor- 
mance during index deployment. Second, recent work on on-line 
index selection does not address effects of complex interactions that 
manifest during index deployment. 

In this paper, we propose a new approach that incorporates tran- 
sitional design performance into the overall problem of physical 
database design. We call our approach Incremental Database De- 
sign. As the first step in this direction, we study the problem of 
ordering index deployment. The benefits of a good index deploy- 
ment order are twofold: (1) a prompt query runtime improvement 
and (2) a reduced total time to deploy the design. Finding an effec- 
tive deployment order is difficult due to complex index interaction 
and a factorial number of possible solutions. 

We formulate a mathematical model to represent the index order- 
ing problem and demonstrate that Constraint Programming (CP) is 
a more efficient solution compared to other methods such as mixed 
integer programming and A* search. In addition to exact search 
techniques, we also study local search algorithms that make signif- 
icant improvements over a greedy solution with minimal computa- 
tional overhead. 

Our empirical analysis using the TPC-H dataset shows that our 
pruning techniques can reduce the size of the search space by many 
orders of magnitude. Using the TPC-DS dataset, we verify that our 
local search algorithm is a highly scalable and stable method for 
quickly finding the best known solutions. 

1. INTRODUCTION 

The selection and deployment of indexes has always been one of 
the most important roles of database administrators (DBAs). Both 
industry and academia have intensively focused their study on the 
automatic selection of indexes in physical database design 1 1 , 24]. 
Every modern commercial database management system (DBMS) 
ships an automatic design tool as its key component. These design 
tools support DBAs by suggesting sets of indexes that dramatically 
improve query execution. 

However, recent software mandates complex data processing over 
hundreds or thousands of tables, making the selection of an appro- 
priate set of indexes impossible for human DBAs and extremely 
challenging for automated tools. Furthermore, modern enterprises 
demand operational business intelligence, where always-on data 
warehouses support complex analytic queries over continuously 
evolving datasets. Last but not least, the queries, data, and even 



schema in very large data-warehouses are continually evolving. 
The main reason for this is two fold. 

• Several iterations are often required to accurately translate 
business requirements into database schema. 

• Businesses dynamically change their requirements. As a re- 
sult, they have to continuously collect and analyze new kinds 
of data for timely decisions. 

This problem has been studied as schema evolution [4| mainly for 
logical table schema designs. For large data-warehouses, frequently 
running the off-line tools and deploying all the suggested indexes 
is impractical. 

One emerging approach for solving this problem is the online- 
index selection (5||19|. The main idea is to keep monitoring the 
queries on the database to deploy (or drop) appropriate indexes 
when it sees a shift in query workload. The online approach can 
quickly react to the change in the database. Furthermore, the se- 
quence of small deployments will adaptively lead to an optimized 
state of the data-warehouse over time. 

Although the online approach is a great step towards optimiza- 
tion for dynamically shifting workloads, it has limitations too. By 
its nature, the online index selection approach selects a single or 
a small number of indexes at a time. If it is necessary to deploy 
several indexes on related tables together to speed-up queries, the 
approach is not likely to select them, yielding in local optima. This 
problem arises because, not only query workloads, but also the log- 
ical table schema is changing. Even a small change in business re- 
quirements sometimes requires drastically different queries as well 
as logical and physical design. 

For example, imagine a popular online digital music shop, z'Zunes 
Store. Hundreds of millions of customers are registered in a table 
CUSTOMER (CUSTID, NAME, ADDRESS, COUNTRY, ...). The 
table is currently clustered by its dimensional attribute COUNTRY 
because the company's analysts' roll-up reports are categorized by 
the customers' countries of residence. The company has received 
an outpouring of complaints from customers that it is quite incon- 
venient that they need to create and switch between multiple ac- 
counts to buy music from localized versions of iZunes Store in dif- 
ferent countries. Thus, the company decided to tie each customer 
to multiple countries. To accomodate this small change, the logical 
database schema evolved to add a new n;n table 
CUST_COUNTRIES(CUSTID, COUNTRY), eliminating COUN- 
TRY from CUSTOMER. 

This schema change requires the analysts to modify many, per- 
haps all, of their reports. Moreover, with regard to the physical 
schema, all clustered and secondary indexes on CUSTOMER must 
be drastically re-designed, as well as the materialized views joining 
the table with related dimensions. 

An online index approach cannot capture the impact of such a 



change. The common approach in online index selection is to pre- 
compute a set of potentially beneficial indexes and only re-evaluate 
their benefits for given representative queries (5||19|. This method 
does not work well in the aforementioned situation because an en- 
tirely different set of indexes must be considered. 

Further, their selection algorithm often does not consider com- 
plex interactions between indexes. To process complex multi-join 
analytic queries, it is often required to deploy more than 10 indexes 
simultaneously. There are also interactions that speed-up index 
creation, which require optimizing the order of index deployment 
studied in this paper. Exploiting the complex index interactions re- 
quires a detailed analysis over millions of queries and thousands of 
candidate indexes, which is impossible in on-line design tools. 

The root problem is that their selection algorithm must be as 
low-overhead as possible in order to continuously monitor query 
workloads and quickly react to a shift of workload. 

1.1 Incremental Database Design 

Large #Queries, Mndexes Small 
Once Tuning Frequency High 
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Figure 1: Incremental Database Design 

Motivated by the observations above, we have started exploring 
a new approach positioned between the two extremes; off-line and 
on-line as illustrated in Figure [T] Our target is a very large data- 
warehouse which needs a drastic change in its physical design. It 
will frequently need design changes, so it is necessary to consider 
not only the query runtime but the deployment time of suggested in- 
dexes to incrementally evolve along with the business requirement. 
On the other hand, the change is relatively less frequent (e.g., a 
week) than what on-line index approach is targeting (e.g., minutes 
or hours). This allows us to employ more sophisticated analysis on 
the choice of indexes and their deployment schedule. 

We call this new type of database design tool as Incremental 
Database Design (IDD) and are studying its requirements, design 
and implementation as a long term project. 

One interesting use case of IDD is the real-time recovery. Nowa- 
days, it is becoming common to deploy a large data-warehouse over 
a number of commodity machines. In such a system, a node failure 
necessitates recovering the part of indexes and materialized views 
stored in the node. In this case, the DBA can use an IDD tool to 
complement the performance degradation because of the lack of 
indexes as soon as possible. 

The first challenge towards this direction, which this paper mainly 
studies, is how to schedule the deployment of indexes to quickly 
complete the deployment or achieve the majority of query speed- 
ups. As mentioned at the beginning of this section, the deployment 
of indexes is an important aspect of database maintenance. Deploy- 
ing indexes is a very costly operation and DBAs give it as much 
care and attention as possible. It consumes immense hardware re- 
sources and takes a long time to complete on large tables. 

Moreover, it is likely that a database requires hundreds of in- 
dexes to be deployed due to the growing number and complexity 
of queries and schema. For example, a database designer built in a 
commercial DBMS suggested 148 indexes for the TPC-DS bench- 
mark which took more than 24 hours to be deployed in the DBMS 
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Figure 2: Index Deployment Orders: Good vs. Bad 

even with the smallest (Scale- 100) instance. 

1.2 Index Deployment Order 

We observed that during the long process of deploying many in- 
dexes over large databases, the order (sequence) of index deploy- 
ment has two significant impacts on user benefits, illustrated in Fig- 
ure]^] First, a good order achieves prompt query runtime improve- 
ments by deploying indexes that yield greater query speed-ups in 
early steps. For example, an index that is useful for many queries 
should be created first. 

Second, a good order reduces the deployment time by allowing 
indexes to utilize previously built indexes to speed up their deploy- 
ment. For instance, the index i\ (LANG, REGION) should be made 
after the wider index i 2 (LANG, AGE, REGION) to allow building 
from the index, not the table. We observe in the TPC-DS case that 
a good deployment order can reduce the build cost of an index up 
to 80% and the entire deployment time as much as 20%. 

Despite the potential benefits, obtaining the optimal index or- 
der is challenging. Unlike typical job sequencing problems J3l, 
both the benefit and the build cost of an index are dependent on 
the previously built indexes because of index interactions. These 
database specific properties make the problem non-linear and much 
harder to solve. Also, as there are n! orderings of n indexes, a triv- 
ial exhaustive search is intractable, even for small problems. 

One prevalent approach for optimization problems is to quickly 
choose a solution by a greedy heuristic. However, the quality of 
a greedy approach can vary from problem to problem and has no 
quality guarantee. Another popular approach is to employ exact 
search algorithms such as A* or mixed integer programming (MIP) 
using the branch-bound (BB) method to prune the search space. 
However, the non-linear properties of the index interactions yield 
poor linear relaxations for the BB method and both MIP and A* 
degenerate to an exhaustive search without pruning. 

2. OVERVIEW 

In this paper, we formally define the ordering problem as a math- 
ematical model and propose several pruning techniques not based 
on linear relaxation but on the combinatorial properties of the prob- 
lem. We show that these problem specific combinatorial properties 
can reduce the size of the search space by many orders of mag- 
nitude. We solve the problem using several techniques including, 
Constraint Programming (CP) and MIP, and show that this kind 
of problem is easiest to model and has better performance in a 
CP framework. We then extend the CP model using local search 
methods to get high quality solutions very quickly for larger prob- 
lems. We evaluate several local search methods and devise a vari- 
able neighborhood search (VNS) method building on our CP model 
that is highly scalable and stable. 

Figure p] gives an overview of our CP-based solution for the in- 
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Figure 3: Solution Overview 

dex ordering problem. Given a query workload, we first run a phys- 
ical database design tool to obtain a set of suggested indexes. Then, 
we analyze the indexes. To avoid actually creating indexes, we use 
what-if |7| interface of the DBMS to hypothetically create each in- 
dex and evaluate its benefits using the query optimizer. The result is 
a matrix which stores the benefits and creation costs of all indexes 
as well as the interactions between them. When formulating this 
matrix as CP code, we also derive additional constraints to speed 
up the CP solver. The CP/LNS engine then solves the problem and 
produces the optimized index deployment order. 

To explain each piece of the solution, this paper is organized as 
follows. Section[3]reviews the related work. SectionEJformally de- 
fines the problem of index deployment. SectionBlprovides several 
techniques to efficiently solve the problem. Section[6]describes our 
CP model for the problem. Section [7] extends the CP model with 
local search to solve larger problems. Section[8]reports the experi- 
mental results and Section|9]concludes this paper and discusses our 
next step towards incremental database design based on this work. 

In summary, our contributions are: 

• Vision of incremental database design 

• A formal description of the index deployment order problem 

• Problem specific properties to reduce problem difficulty 

• Models and algorithms for Greedy, MIP, CP and Local Search 

• Analysis of various solution techniques and solvers 

• Empirical analysis on TPC-H and TPC-DS. 

To the best of our knowledge, this work is the first to study CP 
methods in the context of physical database design despite its sig- 
nificant potential as an accurate and scalable design method. 

3. RELATED WORK 

3.1 Physical Database Design 

Because of the complexity of query workloads and database me- 
chanics, no human database administrator (DBA) can efficiently se- 
lect a set of database objects (e.g., indexes) subject to resource con- 
straints (e.g., storage size) to improve query performance. Hence, 
significant research effort has been made both in academia and in 
industry to automate the task of physical database design |8 24 1. 

The AutoAdmin project 1 1 1 pioneered this field by implementing 
the what-if method |7 10| which creates a set of potentially ben- 
eficial indexes as hypothetical indexes to evaluate their expected 
benefit by the database's query optimizer. 



Once the benefits of each index are evaluated, the problem of 
database design is essentially a boolean knapsack problem, which 
is NP-hard. The database community has tried various approaches 
to solve this problem. The most common approach is to use greedy 
heuristics based on the benefit of indexes 1 8 1 or on their density [24 1 
(benefit divided by size). However, a greedy algorithm is not as- 
sured to be optimal and could be arbitrarily bad in some cases. 
Hence, some research has explored the use of exact methods such 
as mixed integer programming (MIP) 1 16 18] and A* search |13| . 

Despite the wealth of research in physical database design, the 
problem of optimizing index deployment order has not been stud- 
ied closely. Practically all prior work in this field considers both 
the query workloads and the indexes as a set. One exception is (2) 
which considers a query workload as a sequence, but only con- 
siders dropping and re-creating existing indexes to reduce main- 
tenance overhead. The work in 1 20] had also considered ordered 
deployment, but primarily as a way to greedily speed up queries at 
every step, rather than optimize the overall index deployment se- 
quence. Bruno et al. 1 6 1 mentioned a type of ordering problem as 
an unsolved problem, but their objective does not consider prompt 
query speed-ups. Also, they only suggested to use A* or Dynamic 
Programming and did not solve the problem in |6|. 

3.2 Online Index Selection 

Schnaitter et al. proposed the COLT framework [ 19 1 which pro- 
gressively deploys (or drops) indexes as the current dominant query 
workload changes. Their approach controls the online tuner over- 
head through clustering similar queries in the workload and a (user- 
specified) bound on the number of optimizer calls per tuning iter- 
ation. However, their designs are limited to single-column indexes 
due to the high complexity of the problem. Moreover, to further 
simplify the problem they assume that the benefit of each candi- 
date index is completely independent. In practice, this is rarely a 
realistic assumption, particularly when candidate multi-column in- 
dexes are considered. 

In [5 1, Bruno et al. propose a similar mechanism that tracks 
newly arriving queries, gathering the potential benefit of hypothet- 
ical candidate indexes. Once it appears that the cost of adding a 
new index is justified by the anticipated query runtime improve- 
ment, the new index is introduced into the physical configuration. 
The algorithm proposed in 1 5 1 can add several indexes at once, but 
it does not choose a particular deployment order. It is also aware of 
possible index interactions, but uses a rudimentary syntactic esti- 
mate based on column overlap between indexes (again, due to high 
problem complexity and potential algorithm overhead). 

The work in 1 20 1 presented a framework for detecting and eval- 
uating the relative degree of index interaction as it affects query 
performance. The authors have suggested using a visualization 
mechanism to assist the DBA decisions by identifying which of the 
candidate indexes have strong interactions. They have furthermore 
proposed an index deployment utility function that is very similar 
to the one we describe in Section |4~T1 However, their solution to in- 
dex deployment problem is ultimately a greedy selection of indexes 
from the set chosen by the DBA. Although they propose using dy- 
namic programming to achieve a better deployment ordering, that 
approach has a number of shortcomings, such as failing to account 
for the cost to build each index and the way index interaction affects 
this cost. 

Although some have explored the problem of index benefit in- 
teraction 1 5 20 1 in their work, they chose to approximate the index 
interaction benefit to avoid invoking the query optimizer too fre- 
quently. They made this choice in order to contain the cost of the 
online algorithms and in order to quickly respond to shifts in user 



workload. While such approach allows for agile database tuning, 
it tends to deploy very few indexes at a time. Thus it effectively 
ignores the problem of order of index deployment, instead always 
choosing the best one (or best few) indexes at each deployment it- 
eration. Furthermore, to our knowledge no one has yet considered 
incorporating the effects of interactions between indexes as they 
affect the cost of index building itself. As we explain in Section 
|4.2| such interaction can have a significant effect on the overall 
index deployment cost. In this work, we use the exact query opti- 
mizer cost estimates to evaluate index interaction and consider the 
potential effects on the cost of building the indexes as they are de- 
ployed. We use CP (as a superior alternative to a greedy or MIP 
approach) and incorporate a number of carefully defined index in- 
teraction rules (see Section |4~2| ( to find a good index deployment 
order. 

In this work, we assume that creation of a single index is an 
atomic process as that is the default DBMSs behavior. A alternative 
approach explored in [14||17| is to build the index piece-by-piece. 
The work in 1 17 1 explores the idea of building the index concur- 
rently with table updates. They also propose the idea of querying 
the incomplete index, provided the query can be answered using 
the part of the index that was already built. A similar idea 1 14], 
explored in the context of a column store DBMS, is to copy and 
reorganize the data content as the queries access it. This approach 
provides more immediate adaptation to the changes in query work- 
load and can also recycle the work already performed by the query. 
In this paper, however, we do not assume these advanced function- 
alities built in the DBMS. 

3.3 Branch-and-Bound 

All decision problems, such as the index order problem, can be 
formulated as tree search problems. Such a tree has one level for 
each decision that must be made and every path from the root node 
to a leaf node represents one solution to the problem. In this way, 
the tree compactly represents all the possible problem solutions. 
However, exploring this entire tree is no more tractable than ex- 
haustive search. Therefore, many tree search techniques have been 
developed to more efficiently explore the decision tree. 

Branch-and-Bound (BB) is a tree search method which prunes 
(a.k.a. removes) sub-trees by comparing a lower bound (best possi- 
ble solution quality) with the current best solution. A* is a popular 
type of BB search method which uses a user-defined heuristic dis- 
tance function to deduce lower bounds. 

MIP solvers, such as IBM ILOG CPlex, are also based on BB. 
MIP uses a linear relaxation of the problem to deduce lower bounds, 
and the pruning power of the MIP is highly dependent on the tight- 
ness of the linear relaxation. 

BB is efficient when the relaxation is strong, however it degrades 
as the relaxation becomes weaker, which is often the case for non- 
linear problems. Also, MIP only supports linear constraints, and 
it is tedious to model non-linear properties using only linear con- 
straints. 

3.4 Constraint Programming 

Similar to MIP, Constraint Programming (CP) does a tree search 
over the values of the decision variables. Given a model, a CP 
solver explores the search tree like a MIP solver would. However, 
there are a few key differences summarized in TablefT] 

First, CP uses a branch and prune (BP) approach instead of BB. 
At each node of the tree, the CP engine uses the combinatorial prop- 
erties of the model's constraints to deduce which branches cannot 
yield a higher quality solution. Because the constraints apply over 
the combinatorial properties of the problem, the CP engine is well 



Table 1: MIP and CP Comparison 
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suited for problems with integer decision variables. Instead of a 
linear relaxation to guide the search procedure in MIP, CP models 
often include specialized search strategies that are designed on a 
problem-by-problem basis 1 3 1. 

Second, CP does not suffer from the restriction of linearity that 
MIP models have. This is especially helpful for our problem which 
has a non-linear objective function and constraints such as nested 
decision variable indexing. 

Third, CP models allow a seamless extension to local search. 
When the problem size becomes so large that proving a solution's 
optimality is impossible, the goal becomes getting a near-optimal 
solution as fast as possible. In this setting, global search techniques 
(such as MIP and CP) often become impractical because they ex- 
haustively search over every sub-tree that has some chance of con- 
taining the optimal solution regardless of how slight the chance is, 
and how large the sub-tree is. Such exact methods are thus in- 
appropriate to quickly find high quality solutions. On the other 
hand, local search on top of CP such as Large Neighborhood Search 
(LNS) 1 23 J combines the pruning power of CP with the scalability 
of local search. 

In later sections, we will contrast these differences more vividly 
with concrete case studies for modeling and solving the index order 
problem. Although we find that CP is highly effective for physical 
database design, to the best of our knowledge this is the first time 
that CP has been applied to this problem domain. 

4. PROBLEM DEFINITION 

This section formally defines the index deployment order prob- 
lem. Throughout this section, we use the symbols, constant values, 
and decision variables listed in Table [2] and [3] Please note that 
although we refer to indexes throughout this paper, any auxiliary 
database structure that speeds up query performance (e.g., MV) can 
be trivially incorporated into our formulation. 

4.1 Objective Values 

Every feasible solution to the problem is a permutation of the 
indexes. An example permutation of indexes {ii,i2,*3} is is — > 
z'i — > i'2. As discussed in the introduction, we want to achieve a 
prompt query runtime improvement and a reduction in total 
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Table 2: Symbols & Constant Values (in lower letters) 



i£l 


An index. I = {ii,i2, .., im} 


q eQ 


A query. 
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A query plan (a set of indexes). 
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Feasible query plans for query q. 


qtime(q) 


Original runtime of query q. 


qspdup(p, q) 


Speed-up of using plan p for query q 
compared to the original runtime of q. 


ctime(i) 


Original creation cost of index i. 


cspdup(i, j) 


Speed-up of using index j for index i. 



deployment time, the smaller the area the better the solution. Thus, 
this objective function considers prompt query speed-ups and total 
deployment time simultaneously. 

4.2 Index Interactions 

This section describes the various index interactions, which 
make the problem unique and challenging. 

Competing Interactions: Unlike typical job sequencing prob- 
lems, completing a job (i.e. building an index) in this problem has 
varying benefits depending on the completion time of the job. 

This is because a DBMS can only use one query execution plan 
at a time. Consider the indexes i\{City) and ii(City, Salary) 
from the following query: 
SELECT AVG (Salary) FROM People 
WHERE City=Prov 

Assume the query plan using i% is 5 seconds faster than a full scan 
while the plan using the covering index 12 is 20 seconds faster. 

The sequence i 1 — > «2 would have a 5 second speed-up when i 1 
is built, and only 20 — 5 = 15 second speed-up when 12 is built be- 
cause the query optimizer in the DBMS picks the fastest query plan 
possible at a given time, removing the benefits of suboptimal query 
plans. Likewise, the sequence £2 — > ii would observe no speed-up 
when ii is built. We call this property competing interactions and 
generalize them by constraint[3]in the mathematical model. 

Query Interactions: It is well known that two or more indexes 
together can speed up query execution much more than each index 
alone. Suppose we have two indexes ii(City) and i2(ErapID) 
for the following query: 

SELECT . . FROM People pi JOIN People p2 
ON (pi .ReportTo=p2 . EmpID) WHERE pl.City=Prov 
A query plan using one index ({ii } and {12}) requires a table scan 
for the JOIN and costs as much as the no-index plan { } . A query 
plan using both i\ and 22 ({ii, *2}) avoids the full table scan and 
performs significantly faster. We call such index interactions query 
interactions. Because of such interactions, we need to consider the 
speed-ups of the three query plans separately, rather than simply 
summing up the benefits of singleton query plans. 

Build Interactions: As a less well known interaction, some in- 
dexes can be built faster if there exists another index that has some 
overlap with the keys or included columns of the index to be built. 

For example, ii(City) and i%{Gity, Salary) have interactions 
in both ways. If 12 already exists, building i\ becomes substan- 
tially faster because it requires only an index scan on i\ rather than 
scanning the entire table. On the other hand, if there already is i\, 
building 12 is also faster because the DBMS does not have to sort 
the entire table. We call these index interactions build interactions 
and generalize it by constraintBlin the mathematical model. 

This means that the index build cost is not a constant in our prob- 
lem but a variable whose value depends on the set of indexes al- 
ready built. Bruno et al. |6| also mentioned this effect earlier. In 
Section[8]we show there exist a rich set of such interactions. 



Table 2 


: Decision Variables (in capital letters) 


Ti €{1,..|/|} 


The position of index i in the deployment order. 
T is a permutation of {1, 2, .., \I }. 


Ri 


Total query runtime after i th index is built. 


Xq,i 


q's speed-up after i th index is built. 


Y P ,i G {0, 1} 


Whether p is available after i th index is built. 


Ci 


Cost to create i th index. 



Precedence: Sometimes, an index must precede some other in- 
dexes. One example is an index on a materialized view (MV). A 
MV in a certain type of DBMS is created when its clustered index 
is built. Non-clustered (secondary) indexes on the MV cannot be 
built before the clustered index. Hence, the clustered index must 
precede the secondary indexes on the same MV in a feasible solu- 
tion for such a DBMS. 

Another example is a secondary index that exploits correlation 
|15| . For example, SQL Server supports the datetime correlation 
optimization which exploits correlations between clustered and sec- 
ondary datetime attributes. To work properly, such an index re- 
quires the corresponding clustered index to be built first. 

Detection: Some prior work explored a way to efficiently find 
such interacting indexes [20|. In our experiments, we detect inter- 
actions by calling the query optimizer with hypothetical indexes as 
detailed in Section[8] 

4.3 Mathematical Model 

Embodying the concepts of index interactions discussed above, 
the full mathematical model is defined as follows, 

Objective: miny^ (Ri-iCi) (1) 

i 

Subject to: Y v . % = {T 3 < i : Vj £ p} : Vp, i (2) 

X q> i — max qspdup(p, q)Y P: i : Vg, i (3) 

p<zplans(q) 

Ri=Y^ (l time q ~ X q.i) ■ Vi ( 4 ) 

1 
Ct { — ctime(i) — max cspdup(i,j) : Vi (5) 

j:Tj<Ti 

pl states that a query plan is available only when all of the 
indexes in the query plan are available. |3l calculates the query 
speed-up by using the fastest query plan for the query at a given 
time, fib sums up the speed-ups of each query and subtract from 
the original query runtime to get the current total runtime, pi cal- 
culates the cost to create index i (pTi because C is indexed by 
the order) by considering the fastest available (Tj < Ti) interac- 
tion. For simplicity, this constraint assumes every build interaction 
is pair-wise (one index helps one other index). So far we have ob- 
served this to be the case, but this constraint can easily be extended 
for arbitrary interactions by doing a similar formulation using X 
and Y variables. 

Given this mathematical formulation, our goal is to find the per- 
mutation with the minimal objective value and prove its optimal- 
ity. However, for large problems where an optimality proof is 
intractable, we are satisfied with any solution that can be found 
quickly and makes a significant improvement over a greedy solu- 
tion technique. 

4.4 Discussion 

In formalizing a problem as rich as the index deployment order 
problem there are many choices to be made. One option is to sim- 
plify the problem to have some nice theoretical properties, such as 
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good approximation algorithms and tight lower bounds. Another 
approach is to include as much sophistication as possible in the 
problem formulation so that it might be deployable in industrial ap- 
plications. By choosing to include all of the index interactions and 
a complex objective function, this work has chosen the later op- 
tion. In doing so, tight lower bounds and theoretical guarantees are 
outside the scope of this formulation. Hopefully this short coming 
is balanced by broader industrial applications. In fact, the exper- 
imental results demonstrate that index interactions are an impor- 
tant consideration to this problem and removing them would have 
a significant effect on solution quality. Recognizing that theoretical 
guarantees are out of reach, this work will conduct a rigorous ex- 
perimental study to understand the performance of several solution 
techniques for the index deployment order problem, and focus on 
the scale of problems that are necessary for industrial deployment. 
The objective function is another area of many choices. For ex- 
ample, putting different weights on particular queries can be incor- 
porated by simply scaling up or down runtimes of the queries. Or, 
one can consider minimizing the total deployment time, ~}2 C\, like 
J6). In either case, most of the modeling and pruning strategies in 
this paper will be usable with minor modifications. 

5. PROBLEM PROPERTIES 

This problem has up to |/|! possible solutions. An exhaustive 
search method that tests all the solutions is intractable even for 
small problems. In this section, we analyze the combinatorial prop- 
erties of the problem. Based on the problem specific structure, such 
as index interactions, we established a rich set of pruning tech- 
niques which significantly reduce the search space. This section 
describes the intuition behind each optimization technique and how 
we apply it to the problem formulation. The formal proofs and cost 
analysis of each technique can be found in Appendix [D] 

These techniques are inherent properties of the problem which 
are independent of a particular solution procedure. In fact, we 
demonstrate that these techniques reduce the runtime of both MIP 
and CP solvers by several orders of magnitude in Section[8] 

5.1 Alliances 

The first problem property is an alliance of indexes that are al- 
ways used together. We can assume that such a set of indexes are 
always created together. 

FigureBlexemplifies alliances of indexes. The figure illustrates 4 
query plans with 6 indexes; (11,13), {ii, 13,65}, {12,25}, {14, ie}. 
Observe that i\ and 13 always appear together in all query plans 
they participate in. Therefore, creating only one of them gives no 
speed-up for any query. This means we should always create the 
two indexes together. Hence, we add a constraint Ti 1 = Ti 3 + 1. 
Same to 14 and i§. Note that 12 and 15 are not an alliance because 
is appears in the query plan {11,13,15} without 12- An alliance 
is often a set of strongly interacting indexes each of which is not 
beneficial by itself. An alliance of size n essentially removes n — 1 
indexes and substantially simplifies the problem. 

5.2 Colonized Indexes 



The next problem property is a colonized index which is a one- 
directional version of alliances. If all interactions of an index, i, 
contain another index, j but not vice versa, then i is called a colo- 
nized index and should be created after j. 

Figure [6] shows a case where i\ is colonized by 12. ii always 
appears with 12 in all query plans i\ participates, but not vice versa 
because there is a query plan that only contains i 2 . 

In such a case, creating i\ alone always yields no speed-up. On 
the other hand, creating 12 alone might provide a speed-up. Thus, 
it is always better to build the colonizer first; Ti 1 > Ti 2 . 

Observe that i\ is not colonized by 13 or {4 because i\ appears in 
plans where only one of them appears . In fact, if the plan { i\ , 12 , 14 } 
is highly beneficial, the optimal solution is 12 — > 14 — > i\ — > iz, 
so T il > Ti 3 does not hold. Likewise, if the plan {ii, 12,13} is 
highly beneficial, the optimal solution is 12 — > 13 — > i\ — > 14, so 
Ti t > T i4 does not hold. 

5.3 Dominated Indexes 

The next problem property is called a dominated index which is 
an index whose benefits are always lower than benefits of another 
index. Dominated indexes should always be created last. 

To simplify, consider the case where indexes have the same build 
cost and every query plan is used for different queries. For the full 
formulation without these simplifications, see Appendix |D| 

Figure 17] depicts an example where i\ is dominated by j'2. The 
maximum benefit of an index is the largest speed-up we get by 
building the index. For example, the maximum benefit of i\ occurs 
when there already exists 13, which is 1 + 3 = 4 seconds. Con- 
versely, the minimum benefit is the smallest speed-up we get by 
building the index. i\ 's minimum benefit happens when there is no 
13 index; only 1 second. On the other hand, both the maximum and 
minimum benefits of 12 are 5 seconds. 

Hence, the speed-up of building i\ is always lower than the speed- 
up of building 12, As our objective favors a larger speed-up at an 
earlier step, we should always build 12 before ii\Ti 1 > Ti 2 . 

5.4 Disjoint Indexes and Clusters 

The next problem property is called a disjoint index, which is 
an index that has no interaction with other indexes. Such indexes 
do not give or receive any interaction to affect the build time and 
speed-up and sometimes we can deduce powerful constraints from 
them. FigurelHlshows an example of a disjoint index 14 and a dis- 
joint cluster M\ — \i\ , 12 , 13} which has no interaction with other 
indexes except the members of the cluster. 

Suppose we already have a few additional constraints that define 
the relative order of {i\, 12, 13} is i\ — >• 12 — >• iz and we need to 
insert 14 into the order. Among the four possible locations for 14, 
we can uniquely determine the best place, which we call the dip. 

We know the placement of 14 does not affect the build cost and 
the speed-up of any index in Mi because 14 and Mi are disjoint. 
In such a case, we should place 14 after an index whose density (the 
gradient of the diagonal line; speed-up divided by build cost) is 
larger than it's density and before an index with a smaller density. 
Otherwise, we can improve the order by swapping 14 with another 



index because the shaded area in Figure [8] becomes larger when 
we build an index with a smaller density first. In the example, the 
best place is between «2 and is, which means derii 1 +i 2 > derii 4 , 
deiii 2 > derii i and den; 4 > derii 3 where den x is the density of 
x. We call this location, the dip and there is always exactly one dip. 

We can generalize the above technique for non-disjoint indexes 
when they have special properties which we call backward-disjoint 
and forward-disjoint. Consider two disjoint clusters Mi and Mj 
which contain index i and j respectively. In order to determine 
whether i precedes or succeeds j in the complete order, we can 
investigate the interacting indexes of i and j. 

i is said to be backward-disjoint regarding j when all interacting 
indexes of i and j are built after i or before j. Conversely, i is said 
to be forward-disjoint regarding j when all interacting indexes are 
built before i or after j, in other words when j is backward-disjoint 
regarding i. A disjoint index is both backward and forward disjoint 
regarding every other disjoint index. Initially most indexes have no 
disjoint properties, but with the additional constraints from other 
properties they often become backward or forward disjoint. 

An intuitive description of i being backward-disjoint regarding j 
is that i and j behave as disjoint indexes when we are considering 
a subsequence j — > X — >• i for arbitrary X, so i is disjoint in a 
backwards order. Because of the property of disjoint indexes, the 
subsequence must satisfy derii < derij if it is an optimal solution. 
Thus, if we know dent > derij, we can prune out all solutions that 
build j before i. Conversely, if i is forward-disjoint and derii < 
derij, then i always succeeds j. 

5.5 Tail Indexes 

Because of the inequality constraints given by the above proper- 
ties, sometimes a single index is uniquely determined to be the last 
{tail) index. In that case, we can eliminate the index from the prob- 
lem for two reasons. First, the last index cannot cause any inter- 
action to speed up other indexes either in query time or build time 
because all of them precede the last index. Second, the interactions 
the last index receives from other preceding indexes do not depend 
on the order of other indexes; all the other indexes are already built. 
Therefore, we can remove the last index and all of its interactions 
from consideration, substantially simplifying the problem. 

We can extend this idea even if there are multiple candidates for 
the last index by analyzing the possible tail index patterns. 

For example, in the TPC-H problem solved in Section [8T[ i\ and 
22 turn out to have many preceding indexes and thus the possible 
orders of them are n (last), n — 1 (second to last) and n — 2 (third 
to last). All possible patterns of the last 3 tail indexes are listed in 
Figure[9] It also shows the last part of the objective area (tail objec- 
tive) for the 3 tail indexes in each pattern (the shaded areas). We can 
calculate the tail objectives because the set of preceding indexes is 
known therefore, regardless of their orders, their interactions to the 
tail indexes are determined. 

Remember that there are many other preceding indexes before 
the tail indexes. Therefore, we cannot simply compare the tail ob- 
jectives. For example, the tail objective of 12 — ► is — >• ii in Fig- 
ure|9]is smaller than that of 14 — > i\ — >• i<i. However, because the 
set of preceding indexes is different, we cannot tell if the former 
tail pattern is better than the latter. 

Nevertheless, we can compare the tail objectives if the set of 
tail indexes is equivalent. 24 — > %\ — > 12 and i\ — \ 24 — > 22 
contain the same set of indexes, thus the set of preceding indexes 
is the same too, which means the objective areas and the order of 
preceding indexes is exactly the same after we optimize the order 
of preceding indexes (again, the tail indexes do not affect preceding 
indexes). Hence, we can determine which tail pattern is better by 
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Figure 8: Disjoint Indexes and Disjoint Clusters 

comparing tail objectives. 

Notice that the tail patterns in Figure [9] are grouped by the set 
of tail indexes and also sorted by the tail objectives in each group. 
The ones with the smallest tail objective in each group are called 
the champion of the group and they should be picked if the set of 
indexes are the tails. 

Now, observe that 12 appears as the last index in every champion 
(in bold font) of all groups. This means 12 is always the last created 
index in the optimal deployment order because its tail is always one 
of the tail champions. 

5.6 Iterate and Recurse 

We can repeat the tail analysis by fixing 12 as the last index and 
considering a sub-problem without 12- Not surprisingly, we could 
then uniquely identify i\ as the second-to-last index. 

Furthermore, by removing the determined indexes (and their query 
plans) and considering the already introduced inequalities, each 
analysis described in this section can apply more constraints. There- 
fore, we repeat this process until we reach the fixed-point. This pre- 
analysis reduces the size of search space dramatically. In the ex- 
perimental section, we demonstrate that the additional constraints 
speed up both CP and MIP by several orders of magnitude. 

6. CONSTRAINT PROGRAMMING 

In this section, we describe how we translate the mathematical 
model given in Section [43] into a Constraint Programming (CP) 
model. We then explain how the problem is solved with a CP 
solver. To illustrate why CP is well suited for this problem, we 
will compare the CP model to that of MIP throughout this section. 

6.1 CP Model 

CP allows a flexible model containing both linear and non-linear 
objectives and constraints. The mathematical formulation presented 
in SectionpOlcan be modeled in standard CP solvers (e.g., COMET) 
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Figure 9: Comparing Tail Indexes of Same Index Set in TPC-H 

almost identically, unlike MIP where the model is more obfuscated 
(an equivalent MIP model is given in Appendix [Bjl. 

Objective: min Y] (R[i - l]G[i]) (6) 

i 

Subject to: alldiffrent(T) (7) 

Y[p,i] = f\(T[j]<i):\/p,i (8) 

X[q, i] — max (qspdup(p, q)Y[p, i]) : Vg, i (9) 

p£plans(q) 

R[i] = J2{qtime(q) - X[q,i]) : Vi (10) 

9 

C\T[i\] = ctime(i) - max((T[j] < T[i])cspdup(i,j)) : Vi (11) 

i 

Objective: Just like the mathematical model, our CP model min- 
imizes the sum of R[i — l]C[i\. Although this sounds trivial, MIP 
cannot directly accept a product of variables (R and C) as an ob- 
jective. 

The most common technique for linearizing a product of vari- 
ables in MIP is to discretize the entire span to a fixed number of 
uniform timesteps and define the value of each variable at each 
timestep as an independent variable [21 J. 

However, in addition to losing accuracy, discretization causes 
severe problems in performance and scalability of MIP which are 
verified in the experimental section. 

alldifferent constraint: The variable T is given in JTl which uses 
alldifferent. This interesting constraint in CP assures all the vari- 
ables in T are a permutation of their values. The same constraint in 
MIP would require |7j 2 inequalities on elements of T. The CP en- 
gine represents it with a single constraint which is computationally 
efficient. This is one of the most vivid examples showing that CP 
is especially suited for combinatorial problems and how beneficial 
it is for modeling and optimization purposes. 

Logical AND: The AND constraints on Y (|2l are translated di- 
rectly into (|8J. Although this sounds trivial, again, it is challeng- 
ing in MIP. Logical AND is essentially a product of boolean vari- 
ables, which is non-linear, just as the objective was. Modeling such 
non-linear constraints causes MIP additional overhead and memory 
consumption as well as model obfuscation. 

MIN/MAX sub-problem: The constraints on X ([3l which em- 
ploy the fastest available speed-up for each query are translated 
directly into (9l. Yet again, this is not easy nor efficient in MIP 
because MIN/MAX is non-linear. 

In MIP, this has to be represented as summation of Y and qspdup 
where only one of Y for each query takes the value of 1 at a given 
time. Some MIP solvers provide min/max constraint and internally 
do this translation on behalf of users, but the more severe problem 
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is its effect on performance. When MIP considers the linear relax- 
ation of X, min/max constraint yields little insight. Hence, its BB 
degenerates to an exhaustive search. 

Nested variable indexing: The constraints on C |5|( are trans- 
lated directly into \l 1[>. However, this causes two problems in MIP. 
One is the MIN/MAX as described above, another is the nested 
variable indexing Cri ■ Notice that T is also a variable. Such a con- 
straint cannot be represented in a linear equation. Hence, MIP has 
to change the semantics of the variable C itself and re-formulate 
the all of the constraints and the objective calculation. 

Additional constraints: Finally, we add the additional constraints 
developed in SectionBlto reduce the search space. 

6.2 Searching Strategy 

CP employs branch-prune (BP) instead of BB used by MIP. These 
two approaches have very different characteristics. In summary, CP 
is a white-box approach with a smaller footprint as opposed to the 
black-box approach of MIP. 

Pruning: CP is able to prune the search space by reasoning over 
the combinatorial properties of the constraints presented in sec- 
tion [6T] It also utilizes the problem specific constraints we de- 
veloped in Section B] to efficiently explore only high quality index 
orders. Our experimental results demonstrate that combinatorial 
based pruning is much more effective for this problem than a BB 
pruning based on a linear relaxation. 

Branching: Users can and must specify how CP should explore 
the search space. In our case, we found that it is most effective for 
the search to branch on the T[i] variables and that a First-Fail (FF) 
search procedure was very effective for solving this problem and 
proving optimality with very small memory footprint. 

A FF search is a depth-first search using a dynamic variable or- 
dering, which means the variable ordering changes in each node of 
the search tree. At each node the variables are assigned by increas- 
ing the domain size. Due to the additional constraints, the domains 
of the T[i] variables vary significantly. This helps the FF heuristic 
to obtain optimality. 

On the other hand, MIP automatically chooses the branching 
strategy. This is efficient when the linear relaxation is strong, but, 
when it is not, the BB search degenerates to an exhaustive breadth- 
first search which causes large memory consumption and compu- 
tational overhead. In fact, we observe that MIP finds no feasible 
solution for large problems within several hours and quickly runs 
out of memory. 

7. LOCAL SEARCH 

Although CP is well suited for this ordering problem, when there 
is a large number of indexes with dense interactions between them, 
proving optimality is intractable. In such a case, our goal is to find 
a high quality solution quickly. 

The simplest approach is to keep running exact search algorithms 
until some time limit and report the best solution. In fact, this is 



the standard method in MIR However, such an approach is often 
impractical to find good solutions within a short time period as de- 
scribed in Section [3~4| On the other hand, the probability of finding 
a good solution with a simple random sampling is too small for 
large problems because of the factorial number of possible order- 
ings. One of the advantages of CP is that a CP formulation can be 
effortlessly extended to Local Search which addresses these prob- 
lems. 

Local search is a family of algorithms for quickly finding high 
quality solutions. There are many possible local search meta-heuristics 
to choose from such as, Tabu Search (TS) 1 12], Simulated Anneal- 
ing, Ant Colony optimization, Large Neighborhood Search (LNS) 1 23 1, 
and Variable Neighborhood Search (VNS). We consider two TS 
methods, LNS and VNS. TS is a natural choice because it is effec- 
tive on problems with a highly connected neighborhood (such as 
this one, where nearly all index permutations are feasible). We also 
consider LNS and VNS because they are a simple extension of a 
CP formulation and the CP formulation proved to be very effective 
on smaller instance sizes. 

7.1 Tabu Search (TS) 

Tabu Search (TS) is a simple method for performing gradient de- 
scent on the index permutation. At each step, TS considers swap- 
ping a pair of elements in T. To avoid being trapped in local op- 
tima and repeating the same swap, TS maintains a Tabu list. The 
elements recently swapped are considered in probation for some 
number of steps (called Tabu length). During those steps, TS does 
not consider swapping those elements and hopefully escapes local 
optima. 

We implemented and evaluated two Tabu Search methods; TS- 
BSwap {Best-Swap) and TS-FSwap (First-Swap). TS-BSwap con- 
siders swapping all possible pairs of indexes at each iteration ex- 
cept the Tabu list, and takes the pair with the greatest improvement. 
TS-FSwap stops considering swaps when it finds the first pair that 
brings some improvement. 

TS-BSwap will result in better quality while TS-FSwap will be 
more scalable because quadratic time of checking all pairs may take 
considerable time in large problems. 

7.2 Large Neighborhood Search (LNS) 

Figure [10] illustrates how a LNS algorithm executes. A LNS 
algorithm works by taking a feasible solution to an optimization 
problem and relaxing some of the decision variables. A CP search 
is then executed on the relaxed variables while the other variables 
remain fixed. If the CP search is able to assign the relaxed vari- 
ables and improve the objective value, then it becomes the new 
current solution, otherwise the solution is reset and a new set of 
variables are randomly selected for relaxation (restart). Like most 
local search algorithms, this procedure is repeated until a time limit 
is reached. In this way, LNS leverages the power of a CP solver to 
efficiently search a large neighborhood of moves from the current 
best solution. 

The CP model for our LNS algorithm was presented in Section 
|6.1| to complete the picture we need to explain our relaxation strat- 
egy. For simplicity we use a very basic relaxation, 5% of the in- 
dexes are selected uniformly at random for relaxation. A new re- 
laxation is made if one of these two conditions is met; (1) the CP 
solver proves no better solution exists in this relaxation; (2) the CP 
solver has to back track over 500 times during the search (in LNS 
this is called the failure limit). We found this relaxation size and 
failure limit effectively drove the search to a high quality solution. 

7.3 Variable Neighborhood Search (VNS) 



One difficulty of a LNS algorithm is how to set the parameters 
for relaxation size and failure limit. As depicted in Figure [TO] if 
they are set too small it is easy to get stuck in a local minimum. If 
they are too large the performance may degrade to a normal CP ap- 
proach. Furthermore, different problem sizes may prefer different 
parameter settings. Our remedy for this difficulty is to change the 
parameters during search. This technique is well known as Variable 
Neighborhood Search (VNS) (Tl). 

Our VNS approach is to start the search on a small neighborhood 
and inspect the behavior of the CP solver to increase the neighbor- 
hood and escape local minima only when it is necessary. The in- 
tuition is, if the relaxation terminates because the CP solver proves 
there is no better solution, then we are stuck in a local minimum and 
the relaxation size must increase. However, if the CP solver hits the 
failure limit without proof, then we should do more exploration in 
the same size neighborhood, which is achieved by increasing the 
failure limit. Specifically, we group the relaxations into groups of 
20 and if more than 75% of these relaxations were proofs then we 
increase the relaxation size by 1%, otherwise we increase the fail- 
ure limit by 20%. 

In the experimental section, we find this VNS strategy has two 
benefits. First it guides the algorithm to high-quality solutions 
faster than a regular LNS and also consistently found higher qual- 
ity solutions. Second, VNS is highly scalable and stable even for a 
problem with hundreds of indexes, which is not the case with the 
other methods. 

7.4 Greedy Initial Solution 

As described in the introduction, greedy algorithms are scalable 
but have no quality guarantees. Nonetheless, a greedy algorithm 
can provide a great initial solution to start a local search algorithm. 

To that end, we devise a greedy algorithm which gives a much 
better initial solution than starting from a random permutation. The 
key idea of the algorithm is to consider interactions of each index as 
future opportunities to enable a beneficial query plan that requires 
two or more indexes. We greedily choose the index with the high- 
est density (benefit divided by the cost to create the index) at each 
step. Here, the benefit is the query speed-up achieved by adding the 
index plus the potential benefits from interactions. We find query 
plans that contain the index but are not yet feasible because of miss- 
ing indexes, then equally attribute the speed-up of the query plan to 
the missing indexes, dividing the benefit by the count of them. For 
more details and analysis of its quality, see Appendix [C] 

8. EXPERIMENTS 

We implemented our prototype of the index ordering problem 
solver with a popular commercial DBMS and its design tool for the 
experiments. We also used COMET 2.1 as a CP/LNS solver and 
ILOG CPlex 12.2 as a MIP solver. All experiments are done in a 
single machine with a Dual-Core CPU and 2 GB of RAM. CPlex 
automatically parallelized the MIP on the dual core while CP and 
local search in COMET only used one core. 

We use two standard benchmarks as datasets; TPC-H and TPC- 
DS. Table H] shows the size of each dataset. TPC-DS is a major 
revision of TPC-H to reflect the complex query workloads and table 
scheme in real data analysis applications. TPC-DS has many more 
queries, each of which is substantially more complex and requires 
several indexes to efficiently process when compared to TPC-H. 
Hence, the design tool suggested 148 indexes (up to 300 depending 
on configurations of the tool). There is even a query plan that uses 
as many as 13 indexes together. We also found a rich set of index 
interactions in both datasets. 

We detect various query plans and interactions as follows. We 



first call the DBMS's what-if query optimizer with all hypothetical 
indexes suggested by the DBMS's database designer. The query 
optimizer returns the best atomic configuration |I0| . We then re- 
move the hypothetical indexes in the atomic configuration and call 
the optimizer again, getting a sub-optimal atomic configuration. 
We repeat these steps several times for each query. The resulting 
set of atomic configurations are the query plans, from which we 
extract the competing and query interactions. We do the same with 
the queries to create indexes for detecting the build interactions. 

Table 4: Experimental Datasets 
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8.1 Exact Search Results 

We verified the performance of each method to find and prove 
the optimal solution with the TPC-H dataset. 

We compared the performance of MIP and CP methods with and 
without the additional constraints, varying the number of indexes 
(size of the problem). For MIP, we discretized the problem for 
|7| * 20 timesteps. We also varied the density of the problem, low 
density means we remove all suboptimal query plans and build in- 
teractions, mid density means we remove all but one suboptimal 
query plan and build interactions with less than 15% effects. 

As can be seen in Table [5] neither MIP nor CP could solve even 
small problems without problem specific constraints, taking time 
that grows factorially with the number of indexes. By applying 
the problem specific constraints (denoted by + ), both MIP and 
CP were dramatically improved and took less than one minute to 
solve all low-density problems. For higher density problems, they 
took substantially longer because the pruning power of additional 
constraints decreases. MIP suffered more from the higher density 
because it results in more non-linear properties discussed in Sec- 
tion[6] VNS quickly found the optimal solution in all cases. In the 
21 indexes and mid-density problem, VNS found a good solution 
within one minute and did not improve the solution for 3 hours. 
This strongly implies the solution is optimal, but there is no proof 
as the exact search methods did not finish. 

Drill-Down Analysis: Table [6] shows how the additional con- 
straints from each problem property affects the performance of the 
complete search experiment described in Section [8~T1 We start with 
no additional constraint and add each problem property one at a 
time in the following order, Alliances, Colonized-indexes, Min/max- 
domination, Disjoint-clusters, and Tail-indexes. We only used ad- 
ditional constraints we could deduce within one minute, so the 
overhead of pre-analysis is negligible. 

The results demonstrate that each of the five techniques improves 
the performance of the CP search by several orders of magnitude 
without affecting optimality. The runtime of CP without pruning 
is roughly proportional to \I\\. Hence, the total speed-up of the 
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8.2 Local Search Results 

We also studied TPC-H and TPC-DS with all indexes, query 
plans, and interactions. Because of the dense interactions and many 
more indexes, the search space increases considerably. Even CP 
with the problem specific constraints cannot prove optimality for 
this problem and gets stuck in low quality solutions. Hence, we 
used our local search algorithms to understand how to find high 



Table 5: Exact Search (Reduced TPC-H): Time [min]. Varied 
the number and interaction density of indexes. VNS: No opti- 
mality proof. DF: Did not Finish in 12 hours or out-of-memory. 



\I\ 

Density 


6 
low 


11 
low 


13 
low 


22 
low 


31 

low 


16 

mid 


21 
mid 


MIP 


<1 


11 


106 


DF 


DF 


DF 


DF 


CP 


<1 


7 


214 


DF 


DF 


DF 


DF 


MIP+ 


<1 


168 


DF 


CP+ 


<1 


1 


DF 


VNS 


<1 


<1? 



Table 6: Pruning Power Drill-Down (Reduced TPC-H). Time 
[min]. 



\I\ 

Density 


6 
low 


11 
low 


13 
low 


18 
low 


22 
low 


25 
low 


31 
low 


16 
mid 


21 
mid 


CP 


<1 


7 


214 


DF 


DF 


DF 


DF 


DF 


DF 


+A 


<1 


DF 


DF 


DF 


DF 


DF 


DF 


+AC 


<1 


69 


DF 


DF 


DF 


DF 


DF 


+ACM 


<1 


249 


DF 


DF 


DF 


DF 


+ACMD 


<1 


24 


DF 


DF 


DF 


+ACMDT 


<1 


1 


DF 



Table 7: Greedy, Dynamic Programming, and 100 Random 
Permutations for Initial Solutions. (TPC-DS is 400 times larger 
in scale.) 



Dataset 


Greedy 


DP 


Random (AVG) 


Random (MIN) 


TPC-H 


47.9 


57.0 


65.5 


51.5 


TPC-DS 


65.9 


70.5 


74.1 


69.6 



quality solutions to these large problems. 

Limited Scalability of Exact Search: The MIP model suffers 
severely on these large problems and CPlex quickly runs out of 
memory before finding a feasible solution with as much as 4 GB of 
RAM. This is because the denser problem significantly increases 
the number of non-zero constraints and variables, and CPlex can- 
not significantly reduce the problem size in the pre-solving step. 
In fact, over 1 million integer variables remain after pre-solving 
for problems of this size. This result verifies that a linear system 
approach does not scale well for the index ordering problem. 

Although we also tested CP in this and next experiment, CP takes 
a long time to find a solution better than the initial greedy solution 
because it is overwhelmed by a large neighborhood. These results 
demonstrate the need for local search methods in larger problems 
as described in Section|7] 

We then evaluated the performance of local search algorithms 
(TS, LNS, and VNS) described in Section[7]on these problems. All 
the local search methods are implemented in COMET and given the 
same constraints with the same initial solution. 

Algorithm Comparison for Initial Solution: Our local search 
uses the greedy algorithm described in Section |7.4| to come up with 
the initial solution. We compared the quality of the initial solution 
with a Dynamic Programming (DP) algorithm suggested earlier by 
Schnaitter et al (20). Detailed algorithm of our greedy and our 
implementation of the DP algorithm is given in Appendix [C] 

Table [7] shows the objective value of the solutions suggested by 
our greedy, DP, and the average and minimum values of 100 ran- 
dom permutations of indexes. Our greedy solutions are always bet- 
ter than both the average and minimum of random permutations as 
well as than the DP algorithm. 
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Figure 11: Local Search (TPC-H): LNS, VNS and Tabu. (MIP 
runs out memory) 
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Figure 12: Local Search (TPC-DS): VNS and Tabu. (MIP runs 
out memory) 



The main reason our greedy algorithm achieves the better quality 
than the DP algorithm is that the DP algorithm does not consider 
how long building each index will take, assuming all index creation 
costs are uniform. Hence, it often chooses a compact index later 
even if the index has high density (benefit divided by creation cost). 

Another problem in both our greedy and DP is that they do not 
consider build interaction to speed-up deployment time. The re- 
sulting index orders often do not have fast deployment time, which 
is one reason we need to improve the initial solution by the local 
search. 

TPC-H Results: Figure fTTJ shows the quality (y-axis) of solu- 
tions plotted against elapsed search time (x-axis) for the TPC-H 
dataset. The figure compares the LNS, VNS and two Tabu Search 
(TS) methods described in Section[7] 

In this experiment, TS-BSwap achieves a better improvement 
than TS-FSwap because TS-BSwap considers all possible swaps in 
each iteration. VNS is comparable to the two Tabu methods while 
the original form of LNS takes a long time to improve the solution 
because it cannot dynamically adjust the size of its neighborhood. 
We also observed that VNS is more stable than LNS in that it has 
less variance of solution quality between runs. 

TPC-DS Results: Figure [H] compares VNS with Tabu Search 
for the TPC-DS dataset. This time, the improvement of TS-BSwap 
is large but very slow because it takes a very long time (50 minutes) 
for each iteration to evaluate ( 2 ) swaps. VNS achieves the best 
improvement over all time ranges, followed by TS-FSwap. VNS 
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Figure 13: VNS (TPC-DS): Deployment Time and Average 
Query Runtime. 

quickly improves the solution, especially at the first 15 minutes. 
Considering that deploying the 148 indexes on the Scale-100 in- 
stance takes one day, VNS achieves a high quality solution within 
a reasonable analysis time. 

Figure [T3] plots the index deployment time and average query 
runtime during the deployment period to analyze where the im- 
provements of VNS comes from at each time range. The sharp 
improvement at the beginning (15 minutes) of Figure [T2| is mainly 
attributed to the improvement on deployment times by exploiting 
build interactions between indexes. After that, VNS mainly im- 
proves the average query runtime by deploying a set of indexes that 
have significant speed-ups at early steps. 

8.3 Discussions 

Scalability and Robustness: The result shows that VNS is a 
scalable and robust local search method which quickly finds high 
quality solutions in all cases tested. The main reason the TS meth- 
ods sometimes do not work well is essentially the same as why the 
LNS with fixed parameters does not perform well. The neighbor- 
hood size is fixed and it may be too large with TS-BSwap or too 
small with TS-FSwap. 

It is possible to devise a hybrid Tabu method that dynamically 
adjusts the tuning parameters (the number of pairs to check, Tabu 
length, etc) for the problem, but VNS has another important prop- 
erty for avoiding local optima. As VNS relaxes more than two 
variables at each iteration, it can explore multi-swap neighborhoods 
that are necessary to influence large sets of interacting indexes. 

Applicability to Database Design Tools: Because of the scal- 
ability and robustness, VNS on top of CP formulation is highly 
promising to physical database design problems in general such as 
index selection. 

Although the database community has made several efforts to- 
wards MIP and BIP (Boolean Integer Programming) for physical 
database design tools [916 18|, none of commercial tools has em- 
ployed those methods so far. 

One of the vendors told the authors that the main reason to stick 
with greedy algorithm is its scalability for substantially large and 
complex query workloads in the real world up to millions of distinct 
queries. As a commercial tool, it is unacceptable even for such 
huge problems to expose too long runtime (e.g., days to suggest the 
first design) or too unstable quality (e.g., missing indexes that are 
crucially important) when terminated earlier. 

Unlike integer programming, CP formulation achieves the scal- 
ability with robust solution quality by starting from greedy algo- 
rithm and quickly improving it with VNS. Hence, we consider CP 
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and local search as the primary approach for our next step towards 
a database design tool that incrementally optimizes databases. 

9. CONCLUSION AND FUTURE WORK 

In this paper, we proposed our vision towards a physical database 
design tool for large databases to accommodate frequent and dras- 
tic changes in query workloads, logical and physical table schema. 
We call our new design approach as Incremental Database Design 
which differs from both the traditional off-line design tools and on- 
line index selection approaches. The key requirements is to mini- 
mize administrative costs to repeatedly tune large data-warehouses 
without sacrificing query performance improvements. 

As the first step, we defined and solved the optimization problem 
of index deployment ordering. We formalized the problem using a 
mathematical model and studied several problem specific proper- 
ties which increase performance of industrial optimization tools by 
several orders of magnitude. We developed several approaches for 
solving the problem including, a greedy algorithm, CP formulation, 
MIP formulation, and four local search methods. We demonstrated 
that this problem is best solved by a CP framework and found that 
our VNS local search method is robust, scalable, and quickly finds 
high quality solutions on very large problems. 

Our next step is to jointly solve the index selection problem and 
index deployment ordering problem. We are currently working on 
an integrated solution that accounts for the index deployment order- 
ing while choosing a set of indexes to build. The main challenges 
are two fold. First, as we described in this paper, scheduling an op- 
timal deployment order for a single given set of indexes is already 
an expensive analysis. It is obviously impractical to consider the 
order of indexes for every candidate design. Second, now that we 
include in our design tool the deployment time and how quickly 
users will see the query speed-up, we need to provide flexible yet 
easy-to-understand interfaces to let DBAs state their requirements 
in this multi-objective optimization problem. We will tackle these 
issues based on our prior physical database design tool 1 1 6 1 and our 
scalable CP/VNS optimization methods developed in this paper. 
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APPENDIX 

A. SOURCE CODE, DATASETS 

All of our source code and experimental data can be accessed 
on our web site (http : //cs .brown. edu/~cjc/idd/l. This 
includes Java projects, CPlex/COMET models, and problem data 
files. 

Our purpose is two-fold. First, we would like to ensure the re- 
producibility of our experiments. Second, we expect this problem 
will be useful for testing various solver technologies and we want 
to make it available to the operation research community. 

B. FULL MIP MODEL 

This section provides the detailed MIP model for the ordering 
problem. The model uses the input data described in Table [2] and 
defines additional constants and variables in Table [8] Variables an- 
notated with a hat have a slightly different semantics than those in 
the CP model, but their meaning is roughly the same. The biggest 
decision variable change is that the B variables are used to deter- 
mine the orders of indexes. 



Table 8: Additional Symbols & Variables 




d€D 


A discretized timestep. D — {1, 2, . . . 


,\D\\ 


Ai&D 


Time step to start building index i. 


Bi,j e {o, 1} 


Whether index i precedes index j. 


d 


Cost to create index i. 


Xq,d 


q's runtime (not speed-up) at time d. 


Y q , P ,d € {0, 1} 


Whether p is used 

(not only available) for q at d. 


Z itd G {0, 1} 


Whether i available at time d. 


CY % , 3 € {0, 1} 


Whether j is utilized to create i. 



The index order problem can be formulated as a MIP as follows, 



Objective: 
Subject to: 



d \ q 

Btj + Bj ti — 1 : Vi 7^ j 
B l)k < B tJ + Bj, k :\fi^j^k 



Bij < 1 - 



Til + Oj A.J 

\D\ 



VMj 



^ Y q , P ,d = 1 : Vg, d 
p 

Y q ,P,d < %i,d :\/q,p £ plans(q), d,i € p 

Xq,d — 2^, {*q,p,d 

p£_plans(q) 



(qtime(q) — qspdup(p, q))} : Vg, d 
Ai + Ci~d 



< 1 



\D\ 



:Vi,d 



J2 GYi >i < ! : Vi 



CYij<B J:i :Vi,j 

Ci — ctime(i) — \_] (cspdup(i, j)CY i} j) : Vi 



(12) 

(13) 
(14) 

(15) 

(16) 

(17) 

(18) 

(19) 

(20) 

(21) 

(22) 
(23) 



< | 1 3| > assures either i precedes j or j precedes i. dT4b assures the 
index order preserves transitivity; i cannot precede k if j precedes 
i and k precedes j. The A variables determine when each index is 
made. ( | 1 5[ > means that, when i precedes j, Ai has to be d (cost 
to create i) smaller than Aj. Ai + d — Aj is divided by \D\ to 
normalizes the expression to a range between and 1 . 

The Y variables determine whether the plan is used for each 
query at d. Therefore, the sum of Y is always 1 J16) . There is 
always an empty-plan {0} which gives no speed-up to ensure fea- 
sibility of l |16| >. \\1\ assures the plan is available only when all 
indexes in the plan are available. Then, \\9) calculates the runtime 
of each query from Y. 

As constraints l |15|17[> calculate the query performance at a given 
time, constraints ( |20|23[ > calculate the query build cost at a given 
time. 1201 determines whether each index is available at each time 
step by checking A and C. pT) and J22| are equivalent to the 
constraints on Y except the interaction to build index is always 
pair-wise. {23} calculates the time to create each index from them. 

We also add the additional constraints developed in SectionBlby 
posting constraints on A and B (e.g., J3 < 25 yields, £3,5 = 1). 

The objective is simply the sum of X for all time steps, because 
we discretized the time steps uniformly. We also add an imaginary 
query plan which requires all the indexes and makes the runtimes 
of all queries zero. This ensures the objective value is for time 
steps that remain after all the queries are built. 

This MIP model correctly solves the ordering problem but in- 
troduces many constraints and variables (it requires more than 1 
million variables for large problems) due to non-linear properties 
of the problem. Because of this, MIP solvers cannot find a feasible 
solution after several hours when solving large problems. 

C. DETAILED ALGORITHM FOR INITIAL 
SOLUTIONS 

AlgorithmfTlprovides the full greedy algorithm described in Sec- 



tion [74] We developed this algorithm to provide good initial solu- 
tions to our local search methods. 

Algorithm [2] provides our implementation of the dynamic pro- 
gramming scheduling algorithm suggested in |20|. We used Stoer- 
Wagner Min-Cut |22| to divide set of indexes into two sub-clusters. 
Our definition of edge weights between nodes (indexes) is based on 
query and competing interactions. For each query, if the query has 
a plan A that has 10 seconds speed-up using index 1,2,3, 1-2, 2-3, 
1-3 have 10/3=3.3 weights. If the query has another plan B that has 
5 seconds speed-up using index 4,5, 4-5 have 5/2=2.5 weights. As 
1 and 4 speed-up the same query, we consider min(3.3, 2.5)=2.5 is 
the weight of 1-4. We sum-up these weights for all queries. We do 
not consider build interactions nor index build costs in edge weights 
because the Dynamic Programming algorithm does not consider in- 
dex build costs. 



D. FULL PROBLEM PROPERTIES 

This section provides formal proofs and detection algorithms for 
the problem properties discussed in SectionB] 

D.l Proof Preparation 

Notations: Let iV denote a complete sequence of indexes / = 
{ii,l2, ■ ■ ■ ,i n }, e.g., N — ii — > i'2 — > 13. Let L denote a 
subsequence, which is an order of a subset of the indexes, e.g., 
L\ = i\ — > £2. or L2 — 13. Let M denote an unordered set of 
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Algorithm 2: DP: Dynamic Programming Algorithm |20| 
Inputs : Index set /. Query set Q. 
Outputs: Ordered list of indexes TV. 
if \I\ = lthen 

return /; 
end 

// Cluster by Min-cut and Recurse h,h = MIN - CUT {I) ; 
N 1 =DP(I 1 ),N 2 = DP(I 2 ); 
//Merge sub-results by Interleaving 

while JVi and N 2 are not empty do 

benefit! = benefit (Q, TV U TVi.front()); 

benefit 2 = benefit (Q, TV U TV 2 .frontO); 

if benefiti > benefit 2 then 
j TV.append(TVi.pop_front()); 

else 
I TV. append(iV2 .pop_front()); 

end 
end 

TV.append(remaining TVi and TV2); 
return TV; 



Algorithm 1: Interaction Guided Greedy Algorithm 
Inputs : Index set I. Query set Q. 
Outputs: Ordered list of indexes TV. 

N={]; 

while I is not empty do 
bestDensity = 0; 
bestlndex = null; 
foreach i g I do 
benefit = 0; 
foreach q e Q do 

previous = g.getRuntime(TV); 

next = g.getRuntime(TV U i); 

benefit += previous - next; 

//Add remaining interactions to benefit 

foreach p £ plans(q) : i £ p do 

interaction = next - q.getRuntime(p); 
if interaction > and p\N ^ tj> then 

benefit += interaction / \p \ TV|; 
end 
end 
end 

density = benefit / i.getBuildCost(TV); 
if bestlndex = null or density > bestDensity then 
bestDensity = density; 
bestlndex = i; 
end 
end 

TV.append(bestlndex); 
I = I \ bestlndex; 
end 
return TV; 



indexes, e.g., Mi = {i\, i 2 } and let {L} denote the unordered set 



of indexes in L. 

Let C(i, M) be the build cost of index i when indexes in M 
are already built. Let C(L, M) be the total cost of building the 
indexes of L in the order L specifies. As an abbreviation, we 
will use C(i) = C(i, 0), e.g., L\ — ii — ¥ i 2 and C(Li) = 
C(ii) + C(i 2 , {ii}). Let S(i, M) be the query speed-up of build- 
ing i assuming the indexes of M are already built. We will also use 
the S(i) = S(i, 0) abbreviation. Because the eventual speed-up 
achieved by the indexes does not depend on the order of indexes, 
the first parameter of S can be a set of indexes unlike C. 

Let Gi be the basic area of index i. Trivially, Gi — S(i, . . )C(i, . 
To simplify the notation, let us extend G to subsequences as illus- 
trated in Figure [14] Note that the second parameter of both C and 
S is the set of indexes built before. All indexes built after have no 
effect on the value of C and S. 

Finally, let Rm be the total query runtime when indexes in M 
exist, namely Rm = R®~ S(M). For example, the total objective 
area (shaded area) in Figure [T4|is 



Obj(L a 



L b ) 



G a + R La C(L a ) 

+Gi + R La+i C{i,{L a }) 

+G b + R La +i+L b C(L b , {L a + i}) 

The Swap Property: Here we discuss a useful building block 
for the other proofs in this section. Consider the objective values 



of solutions N — L„ 



L, 



L b and TV' = L„ 



i, 



Li — > Lb which are identical except for the swap of Li and Lj. 

Obj(N) = G a + R La C(L a ) 

+G l + R La+Lz C{L l ,{L a }) 

+Gj + R La+Li+Lj C{L j ,{L a + L x \) 

+Gb + RLa+Li+Lj+LtC^Lb, {L a + Li + Lj}) 
Obj(N') = G' a +RL a C(L a ) 

+ G' ] +RL a+ L,C(L ] ,{La}) 

+G't + R La +L 1+ L t C(L,, {L a + Lj}) 

+Gb + RL a +Lj+Li+L b C(Lb, {L a + Lj + Li}) 



Because L a precedes both Li and Lj, G a = G' a - Additionally, 
because both query and build time interactions depend only on the 
set of indexes built before, Lb receives exactly the same interaction 
from indexes in L.. 

Hence, we see 



and Lj. Therefore, Gb = G' b - 



Obj(N) - Obj(N') = (Gt - G[) + (Gj - G'j) 
+R La+Li C{Li,{L a }) - R La+Lj C(Lj,{L a }) 
+R La +L l +L ] (C(Lj,{L a + L t }) ~ C{L t , {L a + Lj})) 

(24) 



Also, consider the case when a swap occurs around an interior 
order, e.g. TV = L a — > Li — > Lj — > Lk — > L b to TV' = L a — > 
Lk — > Lj — > Li — > Lb, where Li and Lk are swapped and Lj 
remains in the middle. By the same argument we can deduce, 
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L a —>i—>L h 

Figure 14: Notations 



(Gi - G'i) + {G t - G 3 ) + (G h - G' k ) 



Obj{N)-Obj{N' 
Gj 

+ RL a + LiC(Li, {La}) — Rz Ja+ L k C{Lk, {L a }) 

+RL a +L i + L j C(Lj, {L a + Li}) 

— R-La+Lk+LjCiLj^La + L k }) 

+RL a + L i +L j +L k (C(Lk, {L a + Li + Lj} 

-C(L t ,{La + L k + Lj}) 



(25) 



D.2 Alliances 

Definition : Allied indexes are a set of indexes that only appear in 
query plans as a complete group and have no external interactions 
for building cost improvements. 

Theorem 1: Every problem has at least one optimal solution N 
in which allied indexes are built consecutively. 

PROOF. Let i be the first created index among some allied in- 
dexes. Suppose a solution TV in which there is a non-empty sub 
sequence Lb between i and its allied indexes, namely TV — L a — > 
i — > Lj — > L„ where Lb contains the allied index of i. Now, con- 
sider an altered solution TV' — L a — > L, ■ — > i — >• Lb. We will 



ply means "every optimal solution should . . . ". 
subsets. In the above case {11,13}, {J2} and 
alliances with only one index, obtaining [ii. ' 
The detection overhead is 0(|P| 2 ). 



is}. We remove 
13} in the example. 



prove the objective of TV' is always smaller or the same as that of 
TV. 

Because i requires the allied indexes contained in Lb to speed 
up any query, G; = G- = and R La +i - RL a ,RL a +i+L 3 - 
RL a +Lj ■ By definition i has no interactions that speed up build- 
ing any index in Lj, therefore Gj — G'j, and C(Lj , {L a + Li}) — 
C(Lj,{L a }). Because R^+Lj < R La ,C{Li,{L a }) > C(L,.,{L a + 
Lj}), from p4), 

Obj(N) - Obj(N') 

= R La C(L,, {L a }) - RLa+LjCiLi, {L a + Lj}) 

> R La + Lj (C(Lt, {La}) - C(Li, {L a + Lj})) > 
Thus, a solution that does not create allied indexes consecutively 

can be improved by swapping so that the allied indexes come closer. 
By induction on the swapping of indexes an optimal solution can 
always contain a consecutive order of allied indexes. □ 

Detection : We detect alliances in problem instances as follows. 
First, we list all interactions as candidate alliances. Second, for 
each alliance, we look for overlaps with the other candidates. In the 
example in Figure 15] is overlaps between { i\ , 13 , is } and {12,45}. 
If there is any overlap, we break the alliances into non-overlapping 

1 If there are not multiple optimal solutions (tie), each theorem sim- 



D.3 Colonized Indexes 

Definition : An index i is called colonized by a colonizer index, 
j, iff all query plans using index i also use the colonizer, j, and the 
index has no interaction to speed up building other indexes. 

Theorem 2: Every problem has at least one optimal solution 
where every colonized index is built after its colonizer. 

PROOF. Let j be a colonized index. Suppose a solution TV in 
which there is a subsequence Lb between i and its colonizer, j, 
namely TV = L a — > i — > L 3 ■ — >• Lb where Lb contains j. Now, 
consider an altered solution N' = L a — > Lj — ► i — > Lb. With 
the same proof as alliance, the objective of N' is always smaller or 
same as that of N. Repeating this yields TV" = L a — > i — > j — > 
Lb which is no worse than all the other solutions that create indexes 
between i and j. 

Consider TV'" = L a — > j — > i — > Lb- By the same discussion, 
we show that TV'" is no worse than TV" and may even be better. 

Once again by induction on the swapping operation, any solution 
that builds a colonized index before its colonizer can be improved 
by moving the colonized index after its colonizer. □ 

Detection : The detection algorithm for colonized indexes and 
its computational cost is quite similar to that of alliances. For each 
index, we consider all the query plans it appears in and take the 
intersection (overlap) of them, which is the colonizer(s). The de- 
tection overhead is 0(|7||P|). 

D.4 Dominated Indexes 

In Section |5~3] we explained a simplified case of dominated in- 
dexes. Here we discuss dominated indexes in detail. 

Definition : Index i is dominated by index k iff all of the follow- 
ing conditions hold. WL a , Lj,j £ Lj in {25}, 

1. S(k,{L a +Lj}) >S{i,{L a + L 3 }) 

2. C(i, {L a + L 3 + k}) > C(k, {L a }) 

3. C(j, {L a + l}) > C(j, {L a + k}) 

4. S(j, {La+M+i}) < S(j, {La+M+k}) : VM G L h j £ M 

5. C{k,{La + L i }) = C(k,{L a }) 

In short, k is always more beneficial and cheaper to build than i. 
Note that these conditions are re-evaluated when some index is de- 
termined to be before or after i or k because indexes after both i and 
k are irrelevant to these conditions. At each iteration we re-evaluate 
these conditions to ensure maximum dominance detection. 

Theorem 3: An optimal solution does not build i before k. 



PROOF. Consider two solutions TV = L a 



L, 



L b and TV' = L a 



k 



Lb- In this setting, i and k 



are single indexes. Therefore, in {25}, 



G, = C{i,{La})S(i,{L a }) 

G'i = C{i,{L a + k + Lj})S{i,{L a + k + Lj}) 
G fc = C{k,{L a + i + Lj})S{k,{L a + i + Lj}) 

G' k = C(k,{La})S(k,{L a }) 

Also by the definition of S and R, 
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S(i,{L a }) + R La+l = R La 

S(i,{L a + k + Lj}) + RL a+i+ L j+ k = Rz Ja +L :j +k 

S(k,{L a + i + Lj} + R La+i+Lj+k = R La +i+L j 

S(k,{L a })+R La+Lk = R La 



Applying these to J25}, we get 

Obj(N) - Obj(N') = (G,-G;.) 

+ J R La (C(i,{L a })-C(fc,{L a })) 
+-RL a +i+L j C"(L J ,{L a + i}) 
-R La+k+Lj C(L 3 ,{L a + k}) 
+R La+i+Lj C(k, {L a + i + Lj}) 
-R La+Lj+k C(i, {L a + k + Lj}) 
Because of the condition (3) and (4), 

... > R La (C(i,{L a })-C(k,{L a })) 
+R La+l+L] C(Lj,{L a }) 
-R La+k+Lj C(Lj,{L a + k}) 
+R La+i+Lj C(k, {L a + i + Lj}) 
-R La +L i+ kC{i, {L a + k + Lj}) 

Because C(Lj,{L a + k}) < C(Lj,{L a }), 

■■■ > C(Lj,{La})(RL a +i+Lj — Rh a + k+Lj) 

+R La (C{i,{L a }) - C(k,{L a })) 
+R La+i+Lj C(k, {L a + i + Lj}) 
-R La+L]+k C(i, {L a + k + Lj}) 
Because C(i, {L a + k + Lj}) < C(i, {L a }) and the condition 

(5) (C(k, {L a + i + Lj}) = C(k, {L a })), 

... > C(Lj,{L a })(R La+i+Lj — R La+k+Lj ) 

+C{i, {L a + k + Lj}){R La - R La+Lj+k ) 

~C{k,{L a })(R La - RL a+l + L] ) 

= C(Lj,{L a })(S(k, {L a + Lj}) - S(i, {La + Lj})) 

+ C{i, {L a +k + Lj})S(Lj + k, {L a }) 

-C{k,{L a })S(Lj+i,{L a }) 
Now, by the definition of S, 

S(Lj+k,{L a }) = S(Lj,{L a })+S{k,{L a +Lj}) 
S(Lj+l,{L a }) = S(Lj,{La})+S(i,{L a +Lj}) 

From the condition (1), S(k,{L a + Lj}) > S(i, {L a + Lj})) and 
S(Lj + k, {L a }) > S(Lj + i, {L a }). Thus, 

. . . >= S(Lj + i, {L a })(C(i, {L a + k + Lj}) - C(k, {L a })) 
From the condition (2), . . . >= □ 

Detection : We find dominated indexes in the following way. For 
each index, we calculate the minimum benefit and the maximum 



creation cost to make the indexes in each query plan. Then, we 
compare its ratio of minimum benefit to maximum cost with every 
other index's ratio of maximum benefit to minimum cost. During 
this procedure, we consider the additional constraints to tighten the 
minimum/maximum. The detection overhead is 0(\I\ \P\)- 

D.5 Disjoint Indexes and Clusters 

Definition : A disjoint index is an index that has no interactions 



with other indexes. 
Let derii(M) = 



S(i,M) 



denote the density of i. Let, L a 



C(i,M) 

i — > Lb be an optimal solution. Suppose a suffix Lj of L a such 
that L a = L' a — > Lj . 

Theorem 4: Every suffix is more dense than i if i is disjoint. 
PROOF. We compare two solutions N = L' a 



and N' = L'„ 
Gi = G.;, Go 



L, 



G'j, thus from J24l 



— ► Lj — > L b 
Lb- Because i is a disjoint index, 



Obj(N) - Obj(N') 



= C(i)(R L , a+l -R La+l ) 

-C(Lj,{L' a })(R La -R La+i ) 

= C(i)S(Lj,{L' a }) - C(Lj y {L'a})S(i) 



The density of i and Lj is derii = 



5(0 



, derii — 



S({Lj},{L' a }) 



C(i)- -'" J C(Lj,{L' a }) 

Hence, Obj{N)-Obj(N') = S(i)S(Lj, {Leiden' 1 -den' 1 ) 
Therefore, if i has a larger density than any suffix, we can improve 
the solution by placing i before the suffix which contradicts the 
optimality assumption of N' . □ 

Likewise, the following theorem regarding a prefix of L b holds. 
The proof is omitted as it is symmetric. 

Theorem 5: Every prefix is less dense than i if i is disjoint. 

Let a dip be the place where we can place a disjoint index i with- 
out violating the two theorems above. Now we prove that there is 
only one dip (except when there are ties). 

Theorem 6: Every sequence has only one dip to insert a disjoint 
index i. 

PROOF. Suppose there are two or more dips. Let d 1 < d% be 
the dips. Consider the sub-sequence Lj between the places di to 
c?2. From Theoremp] Lj has a larger density than i, but from The- 
orem B] Lj has a smaller density than i. By contradiction, there 
cannot be two or more dips. D 

Now, we consider the more general cases of backward and for- 
ward disjoint. Their formal definition is as follows, 

Definition : i is backward-disjoint to k iff all interacting indexes 
of i and k succeed i or precede k. 

Definition : i is forward-disjoint to k iff all interacting indexes 
of i and k precede i or succeed k. 

Theorem 7: An optimal solution does not build k before i if i 
is backward-disjoint to k and derii > den k . 



L, 



Lb is an optimal 



Because 



PROOF. Suppose N — L a -¥ 
solution. 

Consider the interactions i and k could have with L^ 
i is backward-disjoint, none of its interacting indexes are in Lj. 
Also, none of fc's interacting indexes are in Lj either. In other 
words, i and k are disjoint indexes regarding the subsequence Lj. 

Therefore, from Theorem [4] and Theorem [5] k must be denser 
than Lj and Lj must be denser than i. However, by definition 
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dent > dent and we have a contradiction. Therefore, TV cannot 
be an optimal solution. As L a ,Lj,Lb are arbitrary, and include 
empty sets, this means an optimal solution does not build k before 

i. a 



Theorem 8: An optimal solution does not build i before k if i 
is forward-disjoint to k and derii < dent- 

This proof is omitted as it is symmetric to the previous one. 

Detection : We detect such cases as follows. For each pair of 
indexes, we check whether they are forward or backward disjoint 
to each other. If either of them is forward or backward disjoint, we 
can determine the interactions which i and k receive and calculate 
derii, dent- If the situation defined above occurs, we introduce the 
appropriate additional constraints. The overhead of this procedure 
isO(|/| 2 |P|). 

D.6 Tail Indexes 

Definition : Tail indexes are the last indexes to be built in a given 
build order L. Given some subset of indexes M £ J, we defined 
M's tail group as all solutions where tail indexes are permutations 
of M, A tail champion of M is the solution in M's tail group that 
minimizes the tail's objective. 

Theorem 9: A tail champion of M is better or same as all the 
other solutions in A/'s tail group. 

PROOF. Consider the set of preceding indexes A = I \ M and 
its order La- Let us compare the objective of TV = La — > Lm 
and N' = La — > L\ t . Suppose N' is a tail champion of M's tail 
group but TV is not. 



Obj(N) = Ga + RaC(L a ) + Gm + Ra + mC(L m ,A) 
Obj(N') = Ga + RaC(L a ) + G' m + R a+ mC{L' m ,A) 

Now, because TV' and TV are in the same tail group and TV' is the tail 

champion, G M + Ra+mC{L m , A) > G' M + Ra+mC(L' m , A) 
Therefore, Obj(N) - Obj(N') > for every possible L A - □ 

Let F — {Mi, M%, . . .} be the set of all possible tail groups in 
the problem. Let Const be a rule that holds in all tail champions 
of Me F. 

Theorem 10: Const holds in the optimal solution. 

PROOF. From Theorem [5] the only possible optimal solution 
from M's tail group is the tail champion. Because F is a com- 
prehensive set of all possible tail groups, the optimal solution is 
one of the tail champions. 

Thus, regardless which tail group the optimal solution appears 
in, Const holds in the optimal solution. □ 



This theorem proves the property used in Section [575] We note 
that Const can be any kind of rule. For example, "i% appears as 
the last index", "«2 is built after ti", "13 never appears in the last 3 
indexes". 

Detection : At the end of each problem analysis iteration, we 
apply the tail analysis. We start from the tail length of 3 and in- 
crease the tail length until the number of tail candidates exceeds 
the threshold k. For each tail candidate, we calculate the tail ob- 
jective and group them by the set of tail indexes as explained in 
Section [53] The detection overhead is obviously 0{k), thus k is a 
tuning parameter balancing on the pruning power and the overhead 
of pre-analysis. In our experiments, we used k — 50000. 
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